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Abstract 

Using  the  Environmental  Quality  Modeling  program 
FEMWATER  as  a test-bed  code,  27  percent  of  the  time 
needed  to  run  a given  groundwater  flow  application  on 
the  ERDC  Cray  XI  using  four  multistream  processors 
(MSPs)  was  spent  assembling  the  global  stiffness  matrix. 
This  poor  performance  is  because  the  above  code  cannot 
multistream  without  help.  The  technique  of  “coloring” 
the  elements  makes  it  possible  to  multistream  this  section 
of  the  code,  thus  taking  advantage  of  the  hardware 
capability  of  the  machine.  Coloring  for  assembling  the 
global  stiffness  matrix  involves  dividing  the  elements  into 
different  groups  such  that  no  node  point  touches  any 
elements  with  the  same  color.  This  paper  will  present  a 
simple  coloring  algorithm  in  FORTRAN  and  show  how  it 
was  implemented  into  FEMWATER  to  achieve 
multistreaming  on  the  ERDC  Cray  XL  It  will  then  give  a 
detailed  description  on  how  the  program  was  modified, 
what  compiler  options  were  used,  and  what  compiler 
directives  worked  best.  Finally,  timing  results  will  be 
given.  Some  programs  that  have  good  MPI  (or 
equivalent)  communication  are  better  suited  for  running 
in  the  single-streamed  processor  (SSP)  mode.  In  the  SSP 
mode,  coloring  of  the  elements  is  not  needed  for 
assembling  the  global  stiffness  matrix.  Timings  for 
running  in  the  SSP  mode  will  be  shown,  too. 

1.  Introduction 

Vectorizing  and  multistreaming  application  codes  on 
the  Cray  XI  is  essential  to  good  performance.  Because  of 
this,  bottlenecks  in  code  performance  can  occur  in 
surprising  places.  Sometimes  the  algorithm  is  just  not 
suited  for  the  XI.  Many  times,  however,  special 
techniques  and  algorithms  can  be  applied  to  remedy  the 
situation.  This  paper  illustrates  one  such  example  of 
where  “coloring”  the  finite  elements  into  separate  groups 
can  significantly  help.  Assembling  the  element  stiffness 
matrices  into  the  global  stiffness  matrix  originally  took  27 
percent  of  the  total  run  time.  This  part  of  the  program 


now  runs  eight  times  faster  when  using  the  coloring 
scheme  described  in  this  paper.  ! 

I 

2.  FEMWATER  j 

1 

FEMWATER'4,5'  is  a standard  Galerkin  finite 
element  program  for  flow  and  transport.  METIS12'  was 
used  to  partition  the  mesh.  A conjugate  gradient,  iterative 
solver  with  an  incomplete  lower-upper  preconditioner'1' 
was  used  to  solve  the  resulting  system  of  linear  equations 
obtained  from  a Picard  iteration  of  the  nonlinear 

equations.  The  ghost  node  updates  are  done  in  this  study 
using  MPI.  i 

i 

2.1.  Test  Problem. 


Figure  1 illustrates  the  top  view  of  a typical  three- 
dimensional  (3-D)  finite  element  mesh  for  la  remediation 
study.  Several  layers  are  used  to  model  the  soil  layers 
underneath  this  surface.  The  mesh  has  102,996  nodes  and 
187,902  3-D  prism  elements.  Runs  using]  two  and  four 
times  the  original  number  of  elements  are  also  done. 
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2.2.  Assembly  of  Global  Stiffness  Matrix. 


Figure  2 shows  the  original  assembly  process. 
Subroutine  fq468  computes  element  stiffness  matrix 
information.  The  i and  j loops  are  from  1 to  6, 
representing  prism  elements,  ni  is  the  node  or  row 
number  in  the  global  stiffness  matrix,  and  jj  is  the 
column  number  in  the  global  stiffness  matrix.  These 
indices  were  reversed  in  storage  to  have  stride  1 
computations.  Finally,  a search  for  jj  must  be  done 
inside  the  j loop.  Figure  3 shows  how  coloring  modifies 
the  basic  routine.  This  modified  version  is  described  in 
more  detail  below. 

3.  Coloring 

Coloring  for  assembling  the  global  stiffness  matrix 
involves  dividing  the  elements  into  different  groups  such 
that  no  node  point  touches  any  elements  with  the  same 
color.  Figure  4 illustrates  the  process  of  generating  the 
different  groups  through  coloring.  The  algorithm 
presented  in  this  paper  written  in  FORTRAN  (see  Figure 
5)  is  very  similar  to  that  discussed  in  Reference  3. 
Starting  with  color  1 (blue),  an  element  is  first  painted. 
Then  all  elements  that  touch  this  element  are  marked  for 
not  allowing  blue.  A second  blue  element  is  done,  again 
marking  elements  touching  the  new  blue  element  as  not 
allowing  blue.  After  all  the  possible  blue  elements  have 
been  painted,  this  procedure  is  repeated  for  the  second 
color  (red).  This  algorithm  is  completed  when  all 
elements  have  been  provided  a color.  Finally,  all 
elements  that  contain  the  same  color  belong  to  the  same 
group. 

The  code  in  Figure  2 can  now  be  changed  to  the  code 
in  Figure  3.  The  do  loop  over  mm  can  now  be 
multistreamed  and  vectorized.  This  will  now  be 
described. 


do  m = 

1 , nel 

call 

fq468  (m) 

do  i 

= 1,  6 

ni 

= iem(i) 

do 

j = 1,  6 

I c Search  here  to  find  the 

column  number  jj 

for  the 

row  ni  . 

global_stiff ( j j , ni 

) = 

global_ 

stiffljj,  ni)  + 

& 

element_stif f ( i , 

j) 

end  do 

end 

do 

end  do 

Figure  2.  Original  assembly  process  before  coloring 


Figure  3.  Modified  assembly  process  after  coloring 


Figure  4.  Coloring  the  elements 


C Initialize  colors  of  all  elements  to 
zero . 

icolor(l  : no_of_eements ) = 0 
i = 0 
iquit  = 0 

do  while  ((i  .It.  max_no_colors ) .and. 

( iquit  . eq.  0 ) ) 

i = i + 1 

c Go  through  all  node  points  to  try  to 
color  elements  with  i. 

do  n = 1,  no_node_points 
mfound  = 0 
do  m = 1 , 

no_elements_connected_to_node_n 

if  {an  element  has  color  i}  mfound  = 

1 

end  do 

if  (mfound  .eq.  0)  then 
iel_f irst_zero  = 0 
do  m = 1 , 

no_elements_connected_to_node_n 

{If  element  m has  icolor  = 0,  set 
iel_f irst_zero  = m} 
end  do 

if  ( iel_f irst_zero  .ne.  0)  then 
c Set  element  to  color  i here. 

icolor (iel_first_zero)  = i 
c Set  the  color  of  all  elements  touching 
iel_f irst_zero  to  -1 

c if  no  color  has  been  assigned  yet. 
do  j = 1 , 

no_nodes_of_iel_f irst_zero 
do  k = 1 , 

no_elements_connected_to_node_j 
iel_k  = 

kth_element_connected_to_node_n 

if  ( icolor ( iel_k)  .eq.  0) 
icolor ( iel_k)  = - 1 
end  do 
end  do 
end  if 
end  if 
end  do 

c Clean  up  -l's  for  the  next  color  and 
check  for  termination, 
iquit  = 1 

do  n = 1 , no_of_elements 

if  (icolor(n)  .eq.  -1)  then 
icolor(n)  = 0 
iquit  = 0 
end  if 
end  do 

end  do 


4.  Multistreaming  and  Vectorizing 

I 

Space  allows  a brief  description  ofj  what  was  then 
done  to  achieve  vectorization  and  multistreaming  of  the 
code.  Figure  6 shows  a portion  of  the  new  algorithm  with 
compiler  notation  provided.  This  listing  was  generated 
using  the  command,  j 


103.  1- 

105.  1 

106.  1 


-<  do  ig  = 1 , nog! 

i 

mml  = istg(ig) 

mm2  = istg(ig  +1)  - 1 


108.  1 !csd$  parallel  do  private 

(mm,  m,  node,  mtyp. 


109.  1 
ni,  jq,  nj) 

115.  1 M 


117  . 
118. 
120. 
121. 
122. 

124. 

125. 

126. 


M 

M 

M 

M 

M 

M MVs < 

M MVs 
M MVs > 


!csd$&  alp,  por,  iq,  iem. 


do  mm  = mml , I mm2 

I 

m = ielg(mm) 

NODE  = IJNOD(M) 

MTYP  = IE (M,  9) 

ALP  = PROPF ( 7 , MTYP) 
POR  = PROPF (8,  MTYP) 
DO  IQ  = 1 , j NODE 

IEM ( IQ)  = IE (M , IQ) 
END  DO  | 


Intermediate  computations. 


M ! dir$ 

M MV < 

M MV 
M MV 


276.  1 

277.  1 

278.  1 

279.  1 
RQ(IQ) 

280.  1 M MV  Mr-< 

281.  1 M MV  Mr 

2 82.  1 M MV  Mr 

QA(IQ,  JQ)  * DELTI 

285.  1 M MV  Mr 
+ (QA ( IQ,  JQ)- 

286 . 1 M MV  Mr  & 

* HP (NJ) 

298.  1 M MV  Mr 
m)  , ni)  = 

2 99.  1 M MV  Mr 

jq,  m) , ni)  + 

300.  1 M MV  Mr  l 

* qb ( iq,  jq) 

301 . 1 M MV  Mr-> 

302  . 1 M MV > 


concurrent 
DO  IQ  = 1 , ] NODE 
NI  = IEM'(IQ) 

RLD(NI) j = RLD(NI)  + 

I 

DO  JQ  = 1,  NODE 
NJ  = IEM ( JQ) 

QA ( IQ,  JQ)  = 

I 

i 

RLD(NI)  = RLD(NI) 

I 

W2  *i  QB  ( IQ,  JQ)  ) 

i 

cmatrx ( iw(iq,  jq, 

cmatrx(iw(iq, 

1 

qa(iq,  jq)  + wl 


END  DO1 
END  DO  ' 


305.  1 M > 

306.  1 


end  do 


I 


!csd$  end  parallel  do 


I 


310.  1 > end  do 


Figure  6.  Assembly  process  showing  compiler 
notations  ! 


Figure  5.  Coloring  algorithm  in  FORTRAN 


5.  Performance  Results 


ftn  -c  -03, aggress  -rm  fasemb_xl . f 

The  meaning  of  the  notations  are  as  follows: 
M-Multistreamed  r-unrolled 

V-Vectorized  s-short  loop 

The  changes  to  the  original  code  are  as  follows: 

1.  Change  loops  to  use  the  coloring  algorithm 
(Figure  3). 

2.  Add  compiler  directives  (lines  108-109,  276, 
306  of  Figure  6).  Sometimes  the  Cray 
streaming  directives  (csd)  work,  and 
sometimes  the  !dir  concurrent  works. 
In  this  case,  using  both  of  them  was  required. 

3.  The  search  to  find  the  column  number  j j for 
the  row  ni  is  a bottleneck.  Therefore,  a one- 
time calculation  was  made  to  avoid  this  search 
(variable  iw  in  lines  298-300  of  Figure  6). 

4.  Manually  inline  subroutine  fq468. 

5.  Remove  error  print  statements  until  after  the 
important  loops. 

An  alternate  way  of  dealing  with  these  loops  is  to  get 
them  to  unroll  as  discussed  in  Reference  3.  To  test  this, 
the  loops  were  manually  unrolled  creating  hundreds  of 
lines  as  illustrated  in  Figure  7. 


I 108 

1 

csd$  parallel  do 

private 

(mm, 

m,  node,  mtyp, 

109 

1 

csd$&  alp, 

por , 

iq,  iem, 

ni , 

jq- 

nj) 

115 

1 

MV-< 

do  mm  = mml , mm2 

117 

1 

MV 

m = ielg 

mm) 

127 

1 

MV 

IEM(l)  = 

IE  (M, 

i) 

128 

1 

MV 

IEM ( 2 ) = 

IE  (M, 

2) 

129 

1 

MV 

IEM ( 3 ) = 

IE  (M, 

3) 

130 

1 

MV 

IEM ( 4 ) = 

IE  (M, 

4) 

305 

1 

MV-> 

end  do 

306 

1 

! csd$  end  parallel  do 

Figure  7.  Assembly  process  using  unrolling 


The  timings  were  the  same  as  that  achieved  from  the 
way  Figure  6 shows  it;  therefore,  the  results  in  Figure  6 
represent  the  best  effort  for  these  loops. 


Table  1 shows  timing  results  for  the  original  mesh, 
twice  the  size  of  the  original  mesh,  eight  times  the  size  of 
the  original  mesh,  and  sixteen  times  the  size  of  the 
original  mesh.  For  each  problem,  the  modified  MSP 
version  of  the  assembly  process  was  sped  up 
approximately  eight  times.  The  problems  were  also  run 
using  16  SSPs.  Despite  the  good  MPI  communication, 
the  modified  MSP  version  runs  four  times  faster  than  the 
original  SSP  version.  Coloring  of  elements  is  not  needed 
in  the  SSP  mode,  because  this  is  for  multistreaming  only. 


Table  1.  XI  timings  (sec)  for  4 MSPs  or  16  SSPs 
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